{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1> Using the Fred2 generators </h1>\n",
    "\n",
    "This tutorial illustrates how to benefit from `Fred2` generators in your immunoinformatic scripts.\n",
    "\n",
    "It is assumed that you know about basic python functionality and have made yourself familiar with `Fred2` through the other `Fred2` ipython notebooks.\n",
    "\n",
    "Let's start by importing the neccessary modules:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from Fred2.Core import Variant, Transcript, Peptide, Protein\n",
    "from Fred2.Core import generate_transcripts_from_variants\n",
    "from Fred2.Core import generate_proteins_from_transcripts\n",
    "from Fred2.Core import generate_peptides_from_proteins\n",
    "from Fred2.Core import generate_peptides_from_variants\n",
    "from Fred2.IO import read_annovar_exonic\n",
    "from Fred2.IO.MartsAdapter import MartsAdapter\n",
    "from Fred2.IO.ADBAdapter import EIdentifierTypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2> Chapter 1: Generators in python </h2>\n",
    "<br/>\n",
    "We will first revisit the concept of generators in python before laying out the benefits of generators in immunoinformatic applications.\n",
    "Python generator functions behave like iterators, i.e. they can be used in a for loop. But contrary to iterators, the elements to be iterated over do not have to reside in memory all at once.\n",
    "\n",
    "As an exaggarated illustration, imagine a class of objects very rich in unique information, therefore taking up a really big chunk of space in memory, say 1Mb each. If you have a lot of them and want to access them in series, you can put them in a list and iterate over them. But putting them in a list, implies they have to exist all at once, making this list already taking up 1Gb of memory if you have only 1000 objects.\n",
    "\n",
    "But if you only have a temporary interest in most of these objects *and* you can create them dynamically, this is an ideal case to save some memory with python generators. They will generate the objects on-the-go (and you can of course retain interesting ones).\n",
    "\n",
    "What makes generators so powerful and flexible is, that the code for iterating a generator is the same as iterating a list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "list:\n",
      "1\n",
      "3\n",
      "5\n",
      "7\n",
      "9\n",
      "generator:\n",
      "1\n",
      "3\n",
      "5\n",
      "7\n",
      "9\n"
     ]
    }
   ],
   "source": [
    "def print_odd_numbers(numbers):\n",
    "    for i in numbers:\n",
    "        if i%2 != 0:\n",
    "            print i\n",
    "            \n",
    "nums1 = [1,2,3,4,5,6,7,8,9,10] \n",
    "nums2 = xrange(11)\n",
    "print \"list:\"\n",
    "print_odd_numbers(nums1)\n",
    "print \"generator:\"\n",
    "print_odd_numbers(nums2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the function **`print_odd_numbers`** works with both *lists* and *generators* (<a href=\"https://docs.python.org/2/library/functions.html#xrange\">xrange</a> is a python generator that will produce all integer numbers in the given range). \n",
    "\n",
    "Note, that in generators, the current object is discarded after you iterate to the next. This way, you do not waste any memory. **But** this also means, you cannot 'reuse' **`nums2`** before you 'reinitialize' the generator. Also, you cannot random-access your objects (e.g. **`my_list[123]`**)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2> Chapter 2: Epitope prediction </h2>\n",
    "<br/>\n",
    "In `Fred2` we are dealing with a lot of sequences of different kinds. We take transcript sequences and integrate mutational variants and consider heterozygosities. We generate individualized protein sequences, from which we will slice peptides and calculate their immunological properties. And we are only interested in a specific set of them, say the predicted binders to a certain MHC molecule.\n",
    "This is an ideal usecase for python generators, as we \n",
    "  * can generate our objects on-the-go,\n",
    "  * are only interested in a few and \n",
    "  * have to deal with a whole lot of them and want to preserve memory.\n",
    "\n",
    "So, for an concrete example, if we are only interested in polymorphic peptides, the high number of heterozygous variants prohibits the construction of all polymorphic transcripts at once. To worsen the situation we do not want to keep track of all resulting proteins but those producing immunological interesting peptides.\n",
    "The combinatorical explosion prohibits such brute-force approaches even on medium sized sets of sequences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In Fred2, we therefore have prepared a python generator solution for the `Fred2` objects of <a href=\"http://fred2.readthedocs.org/en/latest/Fred2.Core.html#module-Fred2.Core.Variant\">variants</a>, <a href=\"http://fred2.readthedocs.org/en/latest/Fred2.Core.html#module-Fred2.Core.Transcript\">transcripts</a>,<a href=\"http://fred2.readthedocs.org/en/latest/Fred2.Core.html#module-Fred2.Core.Protein\">proteins</a> and <a href=\"http://fred2.readthedocs.org/en/latest/Fred2.Core.html#module-Fred2.Core.Peptide\">peptides</a> in the `Fred2.Core` module. The following chapter will introduce the usage of the `Fred2` basic generators.\n",
    "\n",
    "We will start with the **`generate_transcripts_from_variants`** function. To have a small number of variants to show, how the generator works we will use an excerpt of an annovar output ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Variant(g.67705958G>A), Variant(g.234183368A>G), Variant(g.20763686G>-)]\n"
     ]
    }
   ],
   "source": [
    "vars = read_annovar_exonic(\"data/annovar_excerpt.out\")\n",
    "print vars"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The generator will take a list of variants, which must have some form of sequence identifier in their **`coding`** field, designating on which sequence they do denote a variation on. It also takes a <a href=\"http://fred2.readthedocs.org/en/latest/Fred2.Core.html#module-Fred2.IO.DBAdapter\">DBAdapter</a> to retrieve the sequences to the latter identifiers and an **`EIdentifierType`**, to indicate the type of identifier for the **`DBAdapter`**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'NM_144701': <Fred2.Core.Variant.MutationSyntax instance at 0x7f42caebea70>}\n"
     ]
    }
   ],
   "source": [
    "print vars[0].coding #show a variants coding\n",
    "mart = MartsAdapter(biomart=\"http://www.ensembl.org\")\n",
    "trans = generate_transcripts_from_variants(vars, mart, EIdentifierTypes.REFSEQ)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can simply iterate over our transcripts as they are created on-the-go.\n",
    "       "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NM_004004:FRED2_0 681\n",
      "NM_004004:FRED2_1 680\n",
      "NM_001190267:FRED2_0 1824\n",
      "NM_001190267:FRED2_1 1824\n",
      "NM_144701:FRED2_0 1890\n",
      "NM_144701:FRED2_1 1890\n",
      "NM_001190266:FRED2_0 1824\n",
      "NM_001190266:FRED2_1 1824\n",
      "NM_198890:FRED2_0 1335\n",
      "NM_198890:FRED2_1 1335\n",
      "NM_030803:FRED2_0 1824\n",
      "NM_030803:FRED2_1 1824\n",
      "NM_017974:FRED2_0 1767\n",
      "NM_017974:FRED2_1 1767\n"
     ]
    }
   ],
   "source": [
    "for t in trans:\n",
    "    print t.transcript_id, len(t)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are iterating the generator, we print the generated transcripts id and the sequence length. As you can see, the generator creates heterzygous results. This is, because our variants are registered heterozygous!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vars[0].isHomozygous"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have already iterated over our transcript iterator, so if we want to use it again, we have to reinitialize it. Then, we can use it in combination with the next generator, the **`generate_proteins_from_transcripts`** generator. This one will need nothing more than a list of transcripts (or a generator thereof)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NM_004004:FRED2_0 226\n",
      "NM_004004:FRED2_1 12\n",
      "NM_001190267:FRED2_0 607\n",
      "NM_001190267:FRED2_1 607\n",
      "NM_144701:FRED2_0 629\n",
      "NM_144701:FRED2_1 629\n",
      "NM_001190266:FRED2_0 607\n",
      "NM_001190266:FRED2_1 607\n",
      "NM_198890:FRED2_0 444\n",
      "NM_198890:FRED2_1 444\n",
      "NM_030803:FRED2_0 607\n",
      "NM_030803:FRED2_1 607\n",
      "NM_017974:FRED2_0 588\n",
      "NM_017974:FRED2_1 588\n"
     ]
    }
   ],
   "source": [
    "prots = generate_proteins_from_transcripts(generate_transcripts_from_variants(vars, mart, EIdentifierTypes.REFSEQ))\n",
    "for p in prots:\n",
    "    print p.transcript_id, len(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have created our proteins, we can slice peptides with the **`generate_peptides_from_proteins`** generator. Therefore, we have to additionally specify the length of the peptides we want to generate. Here we choose 9mers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1491 1491\n"
     ]
    }
   ],
   "source": [
    " peps = generate_peptides_from_proteins(\n",
    "          generate_proteins_from_transcripts(\n",
    "            generate_transcripts_from_variants(vars, mart, EIdentifierTypes.REFSEQ))\n",
    "        ,9)\n",
    "    \n",
    "ps  = set()\n",
    "count = 0\n",
    "for pp in peps:\n",
    "    ps.add(pp)\n",
    "    count += 1\n",
    "\n",
    "print count,len(ps)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The generator will create a peptide only once, even if the peptides sequence should occur more than once. The originating proteins and within transcripts are kept track of for each occurrence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PEPTIDE:\n",
       " QNILESHFN\n",
       "in TRANSCRIPT: NM_144701:FRED2_0\n",
       "\tVARIANTS:\n",
       "in TRANSCRIPT: NM_144701:FRED2_1\n",
       "\tVARIANTS:\n",
       "in PROTEIN: NM_144701:FRED2_0\n",
       "in PROTEIN: NM_144701:FRED2_1"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(ps)[-1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now we have prepeared an input suited for further analysis like <a href=\"EpitopePrediction.ipynb\">EpitopePrediction</a>."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
